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Abstract —This letter presents a novel approach to extract 
reliable dense and long-range motion trajectories of articulated 
human in a video sequence. Compared with existing approaches 
that emphasize temporal consistency of each tracked point, we 
also consider the spatial structure of tracked points on the 
articulated human. We treat points as a set of vertices, and 
build a triangle mesh to join them in image space. The problem 
of extracting long-range motion trajectories is changed to the 
issue of consistency of mesh evolution over time. First, self¬ 
occlusion is detected by a novel mesh-based method and an 
adaptive motion estimation method is proposed to initialize mesh 
between successive frames. Furthermore, we propose an iterative 
algorithm to efficiently adjust vertices of mesh for a physically 
plausible deformation, which can meet the local rigidity of mesh 
and silhouette constraints. Finally, we compare the proposed 
method with the state-of-the-art methods on a set of challenging 
sequences. Evaluations demonstrate that our method achieves 
favorable performance in terms of both accuracy and integrity 
of extracted trajectories. 

Index Terms —Motion trajectories, articulated motion, mesh 
evolution. 


I. Introduction 

L ONG-range motion trajectories provide more precise and 
integrated information of a movement and have been 
extensively used in various applications such as action recogni¬ 
tion, motion segmentation, video indexing and retrieval, video 
manipulation. It is worth to note that only one camera is set 
in most of the applications, which leads to the loss of much 
visual information and brings many challenges. Sparse feature 
trackers such as KLT feature tracker [ ] is often used to extract 
motion trajectories in video sequence. Moreover, spatially - 
denser trajectories can be obtained by PV tracker [ 2 ] and 
LDOF tracker [3]. PV tracker builds trajectories by sweeping 
forward and backward flow fields and also refines motion 
estimates to enforce long-range consistency. LDOF tracker is 
based on large displacement optical flow (LDOF) proposed 
by Brox et al. [ 4 ]. These trackers share one essential criterion 
that if points are lost possibly due to lighting variation, out 
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Fig. 1. Overview of the proposed long-range motion trajectories extraction 
method from frame t-1 to t. 


of plane rotation, occluded or large displacement, then new 
points will be added. As a result, points in initial video 
frame may not be fully tracked throughout the video sequence. 
However, integrated long-range motion trajectories can be 
obtained by concatenating frame-to-frame optical flow motion 
fields, such as Lagrangian particle trajectories (LPT) used in 
action recognition work [5]. As discussed in [2], this class of 
algorithms may cause trajectories drift by error accumulation. 
In summary, it is challenging to extract both reliable and 
long-range motion trajectories throughout the whole video 
sequence. 

Our approach is inspired by the work on dense surface track¬ 
ing in [ 6 ], [ 7 ], which both formulate a mesh evolution frame¬ 
work including an iterative mesh deformation step. Differently, 
[ 6 ] performs surface-morphing while [ 7 ] provides local rigidity 
constraints of a surface in the iterative mesh deformation step. 
By introducing this mesh evolution framework from 3D space 
to 2D image plane, we extract long-range motion trajectories 
effectively. Specifically, self-occlusion is first detected by 
searching the mesh intersection. Next, vertices in the occlusion 
region and the non-occlusion region will receive specified 
motion estimations for propagating to the next frame. Last, 
vertices are gradually approaching to their actual positions 
by the iterative mesh deformation step, in which different 
types of drifted vertices are recognized and regularized, and 
the local rigidity of the mesh is enforced in an efficient 
way. In this letter, binary silhouettes of articulated human are 
utilized to recognize and regularize drifted vertices. Similar to 
several silhouette-based methods [8]— [ 11 ], the advantages of 
using silhouettes have been proven in various applications, 
e.g. human action, gait recognition, etc.. The extraction of 
silhouettes from a video commonly entails using techniques 
such as background subtraction. Fig. 1 shows an overview of 
the proposed long-range motion trajectories extraction method. 

Our contribution with respect to methods [6], [ 7 ] is that the 
mesh evolution framework is proposed for monocular-camera 
set-up. Self-occlusion of object is one inevitable problem in 
single-view video, so we proposed an effect way to detect 
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the occlusion region. Another problem is that the strategy 
of mapping after meshing is not applicable in single-view 
video due to the self-occlusion, so we proposed a strategy of 
propagating vertices with specified predicted motions. More¬ 
over, some geometric information such as the perspective 
invariance of surface norms does not extend from surface to 
silhouette, so we proposed an efficient way to recognize and 
regularize drifted vertices in 2D. To the best of our knowledge, 
no previous work has attempted to perform the long-range 
tracking of articulated human undergoing partial self-occlusion 
and complicated non-rigid deformations, using silhouettes and 
mesh evolution in a single-view video. 


two line segments run from p\ to P 2 and from p 3 to p±. Then 
any point on the first line is represented as p\ +a(p 2 —pi) and 
similarly p 3 + /?(p 4 — Ps) is for any point on the second line, 
where a and p are scalar parameters. The two line segments 
intersect if we can find a and ft such that: 

Pl +a(p 2 -pi) =P3 + ^(P4 ~Pz) (3) 

Cross both sides with p± — p 3 and P 2 — pi separately, solving 
for a and /?: 

a = II (P3 -Pl) X (pa ~Pz)\\/\\{P2 -Pl) X (p 4 Ps ) 11 (4) 
P = ||(Pl -P3) X (P2 - Pl)\\/\\(P4 - P3) X (P2-Pl)\\ (5) 


II. Proposed trajectories extraction Method 

The input to our system is a monocular video sequence of 
M frames. The stack of silhouettes {5*}, t G {1,..., M} is 
extracted and N tracked points are sampled uniformly on the 
reference silhouette S 1 by a mesh generator algorithm [12], 
[13]. Let 2-dimensional vector p\ G M 2 denote the position of 
a tracked point i in frame t , then a big matrix A is constructed 
as follows: 
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Note that each row of matrix A is a representation of 
one fully tracked trajectory Ti,i G {1,...,7V}. The objec¬ 
tive of our approach is to extract a reliable set of long- 
range trajectories {Ti}. From an alternate point-of-view, N 
track points are physically belonging to a human undergoing 
articulated motion. Therefore, each column of matrix A is 
one instant pose of articulated human which is assumed to 
share the same topology. We consider a planar triangle mesh 
^ t (V, £ , J 7 , V 1 ) which represents a column of matrix A, where 
V = {1,... ,N} is the set of vertices, £ = {(i,j),i,j G V} is 
the set of edges, T = k), k G V} is the set of 

faces, V 1 = {Pi, • • • ,Ptv} * s set °f vertices positions. 
We assume that all meshes {Q 1 } share the same topology 
(V, £,T) but vary at vertex positions V 1 . Therefore, the 
trajectories extraction problem is casted as mesh evolution over 
time. i.e. 


g\v,£,T,v 1 )^g t (v,£N,'P t ) ( 2 ) 

A. Self-Occlusion Detection 

Self-occlusion is commonly occurring between moving 
torso and swinging limbs undergoing articulated motions. By 
taking the advantage of the deformed mesh, we detect the 
occlusion region by finding intersected edges of the mesh. As 
illustrated in Fig. 2(a), during the leg crossing motion, two 
components of mesh intersect in the occlusion region which 
is highlighted in red color. In computational geometry, this is 
a line segment intersection problem which supplies a list of 
line segments in the Euclidean plane and asks whether any 
two of them intersect. As illustrated in Fig. 2(b), suppose the 


If the denominator || (p 2 — Pi) x {jpi — p 3 ) || =0, then the two 
lines are parallel or collinear. Otherwise, if \\(p 2 — Pi) x (p 4 — 
Ps) || 7 ^ 0 as well as 0 < a < 1 and 0 < (3 < 1, then two lines 
intersect. Therefore, intersected edges are found in the mesh 
and corresponding vertices are identified in occlusion region. 



(a) (b) 

Fig. 2. An example of detecting self-occlusion in one frame of Walking 
sequence, (a) intersected edges in occlusion region are colored in red, (b) 
illustration of two intersected edges in the mesh. 


B. Initial Motion Estimation 

In order to propagate mesh Q l ~ x to Q l in the next frame 
for a reliable initial guess, we propose to estimate the vertices 
of Q l through large displacement optical flow (LDOF) [ 4 ], 
polynomial curve fitting, and patch-based average filtering. 
LDOF as a recent successful optical flow method, particularly 
approach the problematic of estimation of articulated human 
motion. However, it does not solve occlusion problem like 
other optical flow methods. Therefore, an adaptive method is 
proposed to estimate motion vectors of vertices of Q l ~ x in 
different image regions: For a vertex p\~ x in non-occlusion 
region, we perform bicubic spline interpolation of LDOF mo¬ 
tion vectors to get the motion vector u \~ l . For a vertex in 
occlusion region, we perform a second-order polynomial curve 
fitting to construct vertex p\ within the range of a discrete 
set of previous five positions. Specifically, the fitting model 

a\ bi ci 
<22 ^2 C2 

coefficients matrix, X t and Yl respectively are input and output 
matrices, i.e. X t = [x t ~i x t _ 2 x t ~s %t -4 #t -5 ], x t 
[tt 2 t 1] T , Yi = [pl~ X p\~ 2 pt 3 pt 4 pI~ 5 ]- Therefore, the 
solution of coefficients matrix is B = YiXj and 

the estimated motion vector is 


is Yi = BX t , where B = 


is the unknown 


u l r x = Bx t - p\~ x (6) 

Moreover, in order to handle the observation noise, we apply 
a patch-based average filter to obtain smoothing result of mo¬ 
tion vectors. Here, a patch is denoted as the set of vertex i and 
its adjacent vertices, i.e. N(i) = {i} U {j : (i,j) G E}. |7V(i)| 
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Fig. 3. Illustration of mesh regularization process, (a) initial mesh and the 
silhouette, (b) vertex density map, (c) the regularization of first type of drifted 
vertices, (d) the regularization of second type of drifted vertices, (e) and (f) 
are displacement vectors of regularized vertices. 


defines the number of vertices in patch N(i). Specifically, the 
proposed motion estimation method is defined as 

Pi-,Initial Pi "f~ iyy/-\i ^ > ^3 

1 [ jeN(i) 


C. Iterative Mesh Deformation 

The previous step provides a reasonable initialization of 
vertex positions at frame t by taking into account the self¬ 
occlusion problem. Further refinement is necessary to solve 
the drift problem which can be caused by non-rigid motion, 
large displacement, variations in appearance and light, and 
interference from ambiguous textures. An iterative solution of 
mesh regularization and rigid mesh deformation is proposed to 
get the optimal estimation result p\{k) with the initialization 
of p\{ 0) = pi-i n i t i a i , where k is the iteration number. We 
then define the energy function as follows: 

N 

i= 1 

In order to reduce the effect of noise and various value 
range of data, the energy function is first normalized by 
linear normalization, then it is fitted by the power function 
(y = ax b ). We then define the iteration stopping criteria by 
the fitted energy function as follows (6 is set as 0.003 in our 
experiments): 


f(k) - f(k - 1 ) 


<9 


(9) 


1) Mesh Regularization: When vertices drift away from 
their actual positions, the constructed mesh no longer meets 
the silhouette constraint. Typically, there are two types of 
drift vertices: the first type is when vertices do not reach 
the actual positions, which leads to the blank of silhouette; 
the second type is when vertices are beyond the range of 
the silhouette, as shown in Fig. 3(a). To predict the target 
position, drifted vertices are gradually regularized toward the 
blank of silhouette and away from non-silhouette area. First, 
we compute the vertex density map, which is a measurement 
of vertices per unit area (within the radius of the longest 
edge of reference mesh Q 1 ), as shown in Fig. 3(b). By giving 
a threshold, the blank of silhouette is simply labeled and 
expressed as a set of pixel points Q = {^ 1 ,^ 2 , •••}, as shown 


as black region in Fig. 3(c). If a subset Qi C Q is within 
the unit area of a vertex i, we denote the vertex i as the 
first type of drifted vertices (Vi), and will predict its target 
position from the pixel points in Qi. As shown in Fig. 3(d), 
if a vertex is beyond the range of silhouette, we denote it as 
the second type of drifted vertices (V 2 ) and predict its target 
position from support adjacent vertices which are denoted 
as Ni = N(i) fi (V\V 2 ). Note that a potential issue could 
occur where a patch of vertices are all second type of drifted 
vertices, that is, N(i) C V 2 and the set Ni = null. Therefore, 
we predict the target positions for the second type of drifted 
vertices in a batch process. The predicted batch of vertices 
will be removed from set V 2 , and keep predicting left vertices 
until V 2 is empty. We can finally regularize the target position 
as follows: 

P%Reg(k) = 

'A$(fc-l) + (l-A)i^- T 5] qj ifieV 1 

Qj^Qi 

< Xpl(k - 1) + (1 - A)|^ ]T p){k - 1) if is V 2 (10) 

j^Ni 

^pKk — 1) else 

Here, | Qi | and |A^ | are the number of elements of set Qi and 
Ni respectively. The A term balances the influence of original 
point and points in support domain; controls the regularization 
pace. In practice A = 2/3 was used for all experiments. Fig. 
3(e) and 3(f) show the results of mesh regularization. 

2 ) Local Rigid Deformation: To preserve the local rigidity 
of the deformed mesh, we map the patches to a global 
coordinate system via per-patch rigid transformations, here the 
rigid transformation is equivalent to an affine transformation 
in 2D image plane. As described in simulation (2), we would 
like to compute the rigid transformation of a reference patch 
in V 1 to best conform it to the corresponding patch in T l , 
such that: 

(Ri,Ti) • s-argmin ^ \\p). Reg (k) - (Rtf)+Ti)\f (11) 
jeJV(i) 

where Ri is the 2 x 2 rigid transformation matrix and Ti is the 
translation vector. This is an instance of procrustes problem, 
which can be solved by procrustes analysis [1 ]. Instead 
of simply using the rigid transformation of patch N(i), we 
also consider the rigid transformations from the neighboring 
patches {N(j)},j G N(i). This procedure preserves the local 
rigidity of mesh deformation better. The vertex position is 
defined as 

pUD(k) = rf- t R M+ T J ) (12) 

1 { )l jeN(i) 

After the mesh regularization and local rigid deformation of 
mesh, one iteration ends and the next iteration begins with the 
updated position, i.e. p\(k) = p\.RD{k). The iteration stops 
when satisfy the stopping criteria in equation 9. 

III. Experiments 
A. Datasets and Baselines 

To evaluate the efficiency of the proposed method, five 
challenging sequences from [15], [16] and Weizmann Human 
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(a) Walk (b) Wheel (c) Handstand (d) Dance (e) Skirt 

Fig. 4. Results of the proposed method on five sequences. The body parts are best viewed in color. 


Action Dataset [17] are used. The challenges of these videos 
include pose change, self-occlusion, rapid movement, and 
scale variation. Our method is also compared with some state- 
of-the-art motion trajectories extraction algorithms including 
KLT tracker [1], PV tracker [2], LDOF tracker [3] and LPT 
[i ]. Their source codes are provided by the authors and the 
parameters are tuned to achieve the best results. 

B. Long-Range Motion Trajectories Extraction 

Fig. 4 illustrates the epitome of five sequences and the 
extracted long-range motion trajectories (the longest motion 
trajectory in time is 141 frames from the Skirt sequence). 
Each sequence has its own characteristics. In the sequence 
Walking , lightly foreshortening and self-occlusion have oc¬ 
curred when the woman moved her left leg diagonal backward 
followed by her right leg moving. The sequences Wheeling 
and Handstanding recorded a complete wheeling action 
and hand standing action respectively, fast movement and 
out-of-plane rotation are the main challenges. The sequence 
Dancing contained complex pose change, foreshortening and 
self-occlusion. In sequence Skirt , the women moved forward 
with her arms lift and then turned sideways, undergoing 
scale variation and out-of-plane rotation. The proposed method 
achieved robust performance over these challenging sequences. 
We also test our approach on the Weizmann Human Action 
Dataset [1 ], and some of the results are shown in Fig. 
7. The visual results can be found in our project website 
http://videoprocessing.ucsd.edu/~yuanyuan/trajectores.html. 

C. Performance Comparison 

To evaluate the accuracy of extracted motion trajectories 
by the state-of-the-art methods and the proposed method, 
we illustrate the visual comparisons in Fig. 5, where self¬ 
occlusion and fast movement happens in sequence Walking 
and sequence Wheeling. It is observed that an abundance of 
points on the leg drifted away or stopped tracking due to self¬ 
occlusion and fast movement when using other four methods 
while the proposed method tracked dense points accurately. 

In this paper, the percentage of tracking length in time 
is computed to evaluate the integrity of extracted motion 
trajectories. From Table I we can observe that the average 
percentage of tracking length in time by KLT, PV, LDOF 
algorithms are less than 100%, that means these algorithms 
can not continually track dense points throughout all the five 
sequences. In contrast, integrated trajectories are obtained by 
LPT and the proposed method. 

To further evaluate the accuracy of integrated motion tra¬ 
jectories extracted by LPT and the proposed method, we 
compute the tracking error based on the provided benchmarks 



(b) Wheel 

Fig. 5. Sub-trajectories of KLT, PV, LDOF, LPT and the proposed method on 
two chalMpiigLeTihiiflmage percentage of tracking length in time. 


Video 

KLT(%) 

PV(%) 

LDOF(%) LPT(%) 

Proposed(%) 

Walk 

57.4 

67.4 

61.6 

100 

100 

Wheel 

35.1 

18.9 

23.1 

100 

100 

Handstand 

42.1 

34.8 

21.4 

100 

100 

Dance 

83.0 

43.8 

34.0 

100 

100 

Skirt 

99 

79.1 

27.5 

100 

100 


of joint center positions in every frame [1 ], [ 16 ]. Fig. 6 
presents the standard deviation of the offset distances in every 
frame of five sequences. It is observed that the proposed 
method outperforms LPT with smaller value of the standard 
deviation of offset distances. It is worth to point out that taking 
advantages of silhouettes may be the main reason that makes 
the proposed method superior to LPT. Silhouette constraints 
play an important role in recognizing and regularizing drifted 
vertices, therefore avoiding the accumulation of errors during 
the tracking. 



(a) Walk (b) Wheel (c) Handstand (d) Dance (e) Skirt 

Fig. 6. The standard deviation of offset distances from extracted joint center 
positions to benchmarks in every frame of five sequences. 
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(a) Wavel (b) Jack (c) Run (d) Jump 

Fig. 7. Results of the proposed method on Weizmann Human Action Dataset. 
The body parts are best viewed in color. 

IV. Conclusion 

This letter presents a novel effective and reliable long-range 
motion trajectories extraction method based on mesh evolution 
and silhouette constraints. Experiments on challenging video 
sequences show that the proposed method guarantees the 
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integrity and accuracy of dense points tracking and performs 
better than several state-of-the-art methods. Since the proposed 
method is applicable to partial occlusion not full occlusion, it 
is limited to some challenge actions like spinning around and 
severe shape deformation. The proposed method is suitable for 
applications where accuracy of the motion estimation is vital. 
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